NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient Algorithms for Cardinality Estimation and Conjunctive Query Evaluation With Simple Degree Constraints

https://doi.org/10.1145/3725233

Im, Sungjin; Moseley, Benjamin; Ngo, Hung; Pruhs, Kirk (June 2025, Proceedings of the ACM on Management of Data)

Cardinality estimation and conjunctive query evaluation are two of the most fundamental problems in database query processing. Recent work proposed, studied, and implemented a robust and practical information-theoretic cardinality estimation framework. In this framework, the estimator is the cardinality upper bound of a conjunctive query subject to ''degree-constraints'', which model a rich set of input data statistics. For general degree constraints, computing this bound is computationally hard. Researchers have naturally sought efficiently computable relaxed upper bounds that are as tight as possible. The polymatroid bound is the tightest among those relaxed upper bounds. While it is an open question whether the polymatroid bound can be computed in polynomial-time in general, it is known to be computable in polynomial-time for some classes of degree constraints. Our focus is on a common class of degree constraints called simple degree constraints. Researchers had not previously determined how to compute the polymatroid bound in polynomial time for this class of constraints. Our first main result is a polynomial time algorithm to compute the polymatroid bound given simple degree constraints. Our second main result is a polynomial-time algorithm to compute a ''proof sequence'' establishing this bound. This proof sequence can then be incorporated in the PANDA-framework to give a faster algorithm to evaluate a conjunctive query. In addition, we show computational limitations to extending our results to broader classes of degree constraints. Finally, our technique leads naturally to a new relaxed upper bound called theflow bound,which is computationally tractable.
more » « less
Full Text Available
PANDA: Query Evaluation in Submodular Width

https://doi.org/10.46298/THEORETICS.25.12

Khamis, Mahmoud Abo; Ngo, Hung Q; Suciu, Dan (April 2025, TheoretiCS)

In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding information-theoretic bounds. In this paper, we describe PANDA, an algorithm that takes a Shannon-inequality that underlies the bound, and translates each proof step into an algorithmic step corresponding to some database operation. PANDA computes answers to a conjunctive query in time given by the the submodular width plus the output size of the query. The version in this paper represents a significant simplification of the original version [ANS, PODS'17]. Comment: 42 pages. This is the TheoretiCS journal version
more » « less
Full Text Available
Polynomial Time Convergence of the Iterative Evaluation of Datalogo Programs

https://doi.org/10.1145/3695839

Im, Sungjin; Moseley, Benjamin; Ngo, Hung Q; Pruhs, Kirk (November 2024, Proceedings of the ACM on Management of Data)

Datalog^ois an extension of Datalog that allows for aggregation and recursion over an arbitrary commutative semiring. Like Datalog, Datalogo programs can be evaluated via the natural iterative algorithm until a fixed point is reached. However unlike Datalog, the natural iterative evaluation of some Datalogo programs over some semirings may not converge. It is known that the commutative semirings for which the iterative evaluation of Datalogo programs is guaranteed to converge are exactly those semirings that are stable. Previously, the best known upper bound on the number of iterations until convergence over p-stable semirings is ∑i=1 ^n (p+2)ⁱ= Θ(pⁿ) steps, where n is (essentially) the output size. We establish that, in fact, the natural iterative evaluation of a Datalogo program over a p-stable semiring converges within a polynomial number of iterations. In particular our upper bound is O(σ p n²( n²lg Λ + lg σ)) where σ is the number of elements in the semiring present in either the input databases or the Datalogo program, and λ is the maximum number of terms in any product in the Datalogo program.
more » « less
Full Text Available
On the Convergence Rate of Linear Datalogo over Stable Semirings

Im, Sungjin; Moseley, Ben; Ngo, Hung; Pruhs, Kirk (March 2024, International Conference on Database Theory)

Full Text Available
Optimizing Nested Recursive Queries

https://doi.org/10.1145/3639271

Shaikhha, Amir; Suciu, Dan; Schleich, Maximilian; Ngo, Hung (March 2024, Proceedings of the ACM on Management of Data)

Datalog is a declarative programming language that has gained popularity in various domains due to its simplicity, expressiveness, and efficiency. But pure Datalog is limited to monotone queries, and cannot be used in most practical applications. For that reason, newer systems are relaxing the language by allowing non-monotone queries to be freely combined with recursion. But by departing from the elegant fixpoint semantics of pure datalog, these systems often result in inefficient query execution, for example they perform redundant computations, or use redundant storage. In this paper, we propose Temporel, a system that allows recursion to be freely combined with non-monotone operators. Temporel optimizes the program by compiling it into a novel intermediate representation that we call TempoDL. Our experimental results show that our system outperforms a state-of-the-art Datalog engine as well as a vectorized and a compiled in-memory database system for a wide range of applications from machine learning to graph processing.
more » « less
Full Text Available
Convergence of datalog over (Pre-) Semirings

https://doi.org/10.1145/3643027

Abo_Khamis, Mahmoud; Ngo, Hung Q; Pichler, Reinhard; Suciu, Dan; Wang, Yisu Remy (April 2024, Journal of the ACM)

Recursive queries have been traditionally studied in the framework of datalog, a language that restricts recursion to monotone queries over sets, which is guaranteed to converge in polynomial time in the size of the input. But modern big data systems require recursive computations beyond the Boolean space. In this article, we study the convergence of datalog when it is interpreted over an arbitrary semiring. We consider an ordered semiring, define the semantics of a datalog program as a least fixpoint in this semiring, and study the number of steps required to reach that fixpoint, if ever. We identify algebraic properties of the semiring that correspond to certain convergence properties of datalog programs. Finally, we describe a class of ordered semirings on which one can use the semi-naïve evaluation algorithm on any datalog program.
more » « less
Full Text Available
On the Convergence Rate of Linear Datalog ^∘ over Stable Semirings

https://doi.org/10.4230/LIPIcs.ICDT.2024.11

Im, Sungjin; Moseley, Benjamin; Ngo, Hung; Pruhs, Kirk (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Cormode, Graham; Shekelyan, Michael (Ed.)
Datalog^∘ is an extension of Datalog, where instead of a program being a collection of union of conjunctive queries over the standard Boolean semiring, a program may now be a collection of sum-product queries over an arbitrary commutative partially ordered pre-semiring. Datalog^∘ is more powerful than Datalog in that its additional algebraic structure alows for supporting recursion with aggregation. At the same time, Datalog^∘ retains the syntactic and semantic simplicity of Datalog: Datalog^∘ has declarative least fixpoint semantics. The least fixpoint can be found via the naïve evaluation algorithm that repeatedly applies the immediate consequence operator until no further change is possible. It was shown in [Mahmoud Abo Khamis et al., 2022] that, when the underlying semiring is p-stable, then the naïve evaluation of any Datalog^∘ program over the semiring converges in a finite number of steps. However, the upper bounds on the rate of convergence were exponential in the number n of ground IDB atoms. This paper establishes polynomial upper bounds on the convergence rate of the naïve algorithm on linear Datalog^∘ programs, which is quite common in practice. In particular, the main result of this paper is that the convergence rate of linear Datalog^∘ programs under any p-stable semiring is O(pn³). Furthermore, we show a matching lower bound by constructing a p-stable semiring and a linear Datalog^∘ program that requires Ω(pn³) iterations for the naïve iteration algorithm to converge. Next, we study the convergence rate in terms of the number of elements in the semiring for linear Datalog^∘ programs. When L is the number of elements, the convergence rate is bounded by O(pn log L). This significantly improves the convergence rate for small L. We show a nearly matching lower bound as well.
more » « less
Full Text Available
$$\text{DAO}^\text{2}$$: Overcoming Overall Storage Overflow in Intermittently Connected Sensor Networks

https://doi.org/10.1109/TNET.2023.3273553

Tang, Bin; Ngo, Hung; Ma, Yan; Alhakami, Basil (May 2023, IEEE/ACM Transactions on Networking)

Many emerging sensor network applications operate in challenging environments wherein the base station is unavailable. Data generated from such intermittently connected sensor networks (ICSNs) must be stored inside the network for some unpredictable time before uploading opportunities become available. Consequently, sensory data could overflow the limited storage capacity available in the entire network, making discarding valuable data inevitable. To overcome such overall storage overflow in ICSNs, we propose and study a new algorithmic framework called data aggregation for overall storage overflow ( DAO2 ). Utilizing spatial data correlation that commonly exists among sensory data, DAO2 employs data aggregation techniques to reduce the overflow data size while minimizing the total energy consumption in data aggregation. At the core of our framework are two new graph theoretical problems that have not been studied. We refer to them as traveling salesmen placement problem ( TSP2 ) and quota traveling salesmen placement problem (Q- TSP2 ). Different from the well-known multiple traveling salesman problem (mTSP) and its variants, which mainly focus on the routing of multiple salesmen initially located at fixed locations, TSP2 and Q- TSP2 must decide the placement as well as the routing of the traveling salesmen. We prove that both problems are NP-hard and design approximation, heuristic, and distributed algorithms. Our algorithms outperform the state-of-the-art data aggregation work with base stations by up to 71.8% in energy consumption.
more » « less
Full Text Available
Datalog in Wonderland

https://doi.org/10.1145/3552490.3552492

Khamis, Mahmoud Abo; Ngo, Hung Q.; Pichler, Reinhard; Suciu, Dan; Remy Wang, Yisu (July 2022, ACM SIGMOD Record)

Modern data analytics applications, such as knowledge graph reasoning and machine learning, typically involve recursion through aggregation. Such computations pose great challenges to both system builders and theoreticians: first, to derive simple yet powerful abstractions for these computations; second, to define and study the semantics for the abstractions; third, to devise optimization techniques for these computations. In recent work we presented a generalization of Datalog called Datalog, which addresses these challenges. Datalog is a simple abstraction, which allows aggregates to be interleaved with recursion, and retains much of the simplicity and elegance of Datalog. We define its formal semantics based on an algebraic structure called Partially Ordered Pre-Semirings, and illustrate through several examples how Datalog can be used for a variety of applications. Finally, we describe a new optimization rule for Datalog, called the FGH-rule, then illustrate the FGH-rule on several examples, including a simple magic-set rewriting, generalized semi-naïve evaluation, and a bill-of-material example, and briefly discuss the implementation of the FGH-rule and present some experimental validation of its effectiveness.
more » « less
Full Text Available
Convergence of Datalog over (Pre-) Semirings

https://doi.org/10.1145/3517804.3524140

Abo Khamis, Mahmoud; Ngo, Hung Q.; Pichler, Reinhard; Suciu, Dan; Wang, Yisu Remy (June 2022, PODS)

Full Text Available

« Prev Next »

Search for: All records